Text mining
References
The reference book is: https://www.tidytextmining.com
And the package is:
if(!require(tidytext)){install.packages("tidytext", repos = "https://cloud.r-project.org/")}
library(tidytext)
Data retrieval
Now, let’s move forward to simple text analysis. First, we need to prepare the data! (as usual)
tokens <- tweets %>%
mutate(id = 1:nrow(tweets)) %>% # This creates a tweet id
select(id, text) %>% # Keeps only id and text of the tweet
unnest_tokens(word, text) # Creates tokens!
tokens
Let’s have a look at word frequencies.
tokens %>%
count(word, sort = TRUE)
This is polluted by small words. Let’s filter that (FIRST METHOD).
tokens %>% mutate(length = nchar(word))
Data frequencies
Now let’s omit the small words (smaller than 5 characters).
NOTE: all the thresholds below depend on the sample!
tokens %>%
mutate(length = nchar(word)) %>%
filter(length > 4) %>% # Keep words with length larger than 4
count(word, sort = TRUE) %>% # Count words
head(18) %>% # Keep only top 12 words
ggplot(aes(x = reorder(word,n), y = n)) + geom_col() + coord_flip() +
xlab("Words")

A better way to proceed is to remove “stop words” like “a”, “I”, “of”, etc (SECOND METHOD). Also, it would make sense to remove the search item and “https”.
data("stop_words")
tidy_tokens <- tokens %>%
anti_join(stop_words) # Remove unrelevant terms
tidy_tokens %>%
count(word, sort = TRUE) %>% # Count words
head(20) %>% # Keep only top 15 words
ggplot(aes(x = reorder(word,n), y = n)) + geom_col() + coord_flip() +
xlab("Words")

Problem: strange characters remain. We are going to remove them by converting the text to ASCII format and omit NA data.
tidy_tokens <- tokens %>%
anti_join(stop_words) %>% # Remove unrelevant
mutate(word = iconv(word, from = "UTF-8", to = "ASCII")) %>% # Put in latin format
na.omit() %>% # Remove missing
filter(nchar(word) > 3, # Remove small words
!(word %in% c("https", "t.co", search_term)) # search_term defined above
)
tidy_tokens %>%
count(word, sort = TRUE) %>% # Count words
head(15) %>% # Keep only top words
ggplot(aes(x = reorder(word,n), y = n)) + geom_col() + coord_flip() +
xlab("Words")

Perfect!
Sentiment
This section is inspired from: https://www.tidytextmining.com/sentiment.html
Sometimes, you may be asked in the process if you really want to download data (lexicons).
Just say yes in the console (type the correct answer: if not, you will be blocked/struck).
First, we need to load some sentiment lexicon. AFINN is one such sentiment database.
if(!require(textdata)){install.packages("textdata", repos = "https://cloud.r-project.org/")}
Loading required package: textdata
library(tidytext)
library(textdata)
afinn <- get_sentiments("afinn")
afinn
To create a nice visualization, we need to extract the time of the tweets.
tokens_time <- tweets %>%
mutate(id = 1:nrow(tweets)) %>% # This creates a tweet id
select(id, text, created_at) %>% # Keeps id, text and date of the tweet
unnest_tokens(word, text) # Creates tokens!
tokens_time
We then use inner_join() to merge the two sets. This function removes the cases when a match does not occur.
library(lubridate)
Attaching package: ‘lubridate’
The following objects are masked from ‘package:base’:
date, intersect, setdiff, union
sentiment <- tokens_time %>%
inner_join(afinn) %>%
mutate(day = day(created_at),
hour = hour(created_at) / 24,
minute = minute(created_at) / 60 / 24,
time = day + hour + minute)
Joining, by = "word"
sentiment
We then compute the average sentiment, minute-by-minute.
Of course, average sentiment can be misleading. Indeed, if a text contains the terms “I’m not happy”, then only “happy” will be tagged, which is the opposite of the intended meaning.
sentiment %>%
group_by(time, day, hour, minute) %>%
summarise(avg_sentiment = mean(value)) %>%
mutate(time = make_datetime(year = 2020, month = 10, day = day, hour = hour*24, min = minute*24*60)) %>%
ggplot(aes(x = time, y = avg_sentiment)) + geom_col()
`summarise()` regrouping output by 'time', 'day', 'hour' (override with `.groups` argument)

There are 24 bars per day, but the y-axis is not optimal…
What about emotions? The NRC lexicon categorizes emotions. Below, we order emotions. The most important impact is the dichotomy between positive & negative emotions.
nrc <- get_sentiments("nrc")
nrc <- nrc %>%
mutate(sentiment = as.factor(sentiment),
sentiment = recode_factor(sentiment,
joy = "joy",
trust = "trust",
surprise = "surprise",
anticipation = "anticipation",
positive = "positive",
negative = "negative",
sadness = "sadness",
anger = "anger",
fear = "fear",
digust = "disgust",
.ordered = T))
We then create the merged dataset.
emotions <- tokens_time %>%
inner_join(nrc) %>% # Merge data with sentiment
mutate(day = day(created_at),
hour = hour(created_at)/24,
minute = minute(created_at)/24/60,
time = day+hour+minute) # Create day column
Joining, by = "word"
emotions # Show the result
The merging has reduced the size of the dataset, but there still remains enough to pursue the study.
Finally, we move to the pivot-table that counts emotions for each day.
g <- emotions %>%
group_by(time, sentiment, day, hour, minute) %>%
summarise(intensity = n()) %>%
mutate(time = make_datetime(year = 2020, month = 10, day = day, hour = hour*24, min = minute*24*60)) %>%
ggplot(aes(x = time, y = intensity, fill = sentiment)) + geom_col() +
theme(axis.text.x = element_text(angle = 80,
size = 10,
hjust = 1)) + xlab("Time")
`summarise()` regrouping output by 'time', 'sentiment', 'day', 'hour' (override with `.groups` argument)
ggplotly(g)
This can also be shown in percentage format.
g <- emotions %>%
group_by(time, sentiment, day, hour, minute) %>%
summarise(intensity = n()) %>%
mutate(time = make_datetime(year = 2020, month = 10, day = day, hour = hour*24, min = minute*24*60)) %>%
ggplot(aes(x = time, y = intensity, fill = sentiment)) + geom_col(position = "fill") +
theme(axis.text.x = element_text(angle = 80,
size = 10,
hjust = 1)) + xlab("Time")
`summarise()` regrouping output by 'time', 'sentiment', 'day', 'hour' (override with `.groups` argument)
ggplotly(g)
Going further would probably involve n-grams, see https://www.tidytextmining.com/ngrams.html
Advanced sentiment
The problem with the preceding methods is that they don’t take into account valence shifters (i.e., negators, amplifiers (intensifiers), de-amplifiers (downtoners), and adversative conjunctions). If a tweet says not happy, counting the word happy is not a good idea! The package sentimentr is built to circumvent these issues: have a look at https://github.com/trinker/sentimentr
(see also: https://www.sentometrics.org)
if(!require(sentimentr)){install.packages(c("sentimentr", "textcat"))}
library(sentimentr)
library(textcat)
First, let’s keep only the tweets written in English!
tweets_en <- tweets %>%
mutate(language = textcat(text)) %>%
filter(language == "english") %>%
dplyr::select(created_at, text)
NOTE: the code above was used to show the function textcat: the language is already coded in the tweets via the lang column/variable. (it suffices to keep the instances for which lang == “en”)
Next, we compute advanced sentiment.
tweet_sent <- tweets_en$text %>%
get_sentences() %>% # Intermediate function
sentiment() # Sentiment!
tweet_sent
NOTE: depending on frequency issues, it is better to analyze at daily or hourly scales. If a word is very popular, then, higher frequencies are more relevant.
tweets_en %>%
rowid_to_column("element_id") # This creates a new column with row number
tweets_en %>%
rowid_to_column("element_id") %>%
left_join(tweet_sent, by = "element_id")
tweets_en %>%
rowid_to_column("element_id") %>%
left_join(tweet_sent, by = "element_id") %>%
group_by(day = day(created_at)) %>%
summarise(avg_sent = mean(sentiment)) %>%
ggplot(aes(x = as.factor(day), y = avg_sent)) + geom_col() + xlab("day")
`summarise()` ungrouping output (override with `.groups` argument)

tweets_en %>%
rowid_to_column("element_id") %>%
left_join(tweet_sent, by = "element_id") %>%
ggplot(aes(x = as.factor(day(created_at)), y = sentiment)) +
geom_jitter(size = 0.2) +
geom_boxplot(aes(color = as.factor(day(created_at))), alpha = 0.5) +
theme(legend.position = "none") + xlab("day")

---
title: "Third party data and basic text mining"
output:
  html_notebook:
    toc: true
    toc_float: true
---

# The general idea

Data transfer is highly controlled. The key notions are **authentication** and **protocol**.

# Downloading tweets with *rtweet*

There are several packages that run an interface with twitter: *rtweet*, *RTwitterAPI*, *streamR* and *twitteR*.		
Recent packages are better because firms update their API policies (and access), thus old protocols sometimes do not work!

## First things first
**First**, the packages. Download...

```{r, warning = FALSE, message = FALSE}
if(!require(rtweet)){install.packages("rtweet")}
```

... and activate.

```{r, warning = FALSE, message = FALSE}
library(tidyverse)
library(plotly)
library(rtweet)
```

## Authentication

**Second**: you need your twitter credentials (you need a twitter account).
Login on twitter and go to: https://developer.twitter.com 

![](twitter1.png)

The next step is crucial: we need to retrieve identification credentials.   
In order to do that, you need to create a Twitter app. Below, you can see mine. 
To create one, simply click on the "Create an app"  button (on the right)

![](twitter2.png)

If you click on the "details" of an app, you can see this:

![](twitter3.png)

The second tab is called "**Keys and tokens**" $\rightarrow$ that's where the info is!!!

![](twitter4.png)


Now we are ready to proceed. The lines below open the connexion with the API.

```{r, warning = FALSE, message = FALSE, eval = FALSE}
consumer_key <- "your_consumer_key"
consumer_secret <- "you_consumer_secret"
access_token <- "your_access_token"
access_secret <- "your_access_secret"

create_token(app = "the_name_of_your_app",
             consumer_key = consumer_key, 
             consumer_secret = consumer_secret, 
             access_token = access_token, 
             access_secret = access_secret
             )
```



```{r, warning = FALSE, message = FALSE, echo = FALSE}
consumer_key <- "xqMm10Vwwl1XAx31CAzYNiqoi"
consumer_secret <- "DRLcKjqPLbioz2i2VEWnWrAQjYRUFCL4sUnNazCxZuQ8BtTqcd"
access_token <- "3261052891-OAqPjEkUQrgkMOkUyEvyWRDVxa76JFa9e52NdNT"
access_secret <- "4f3skjfnwLavOZGNhtuetIlE4gsx8CGXEi2GMwaRfF7n0"

create_token(app = "Big Doudou",
             consumer_key = consumer_key, 
             consumer_secret = consumer_secret, 
             access_token = access_token, 
             access_secret = access_secret
             )
```

Authentication is an important part of the process. For more info on that:  
- https://cran.r-project.org/web/packages/googlesheets/vignettes/managing-auth-tokens.html   
- https://httr.r-lib.org/reference/index.html (section Authentication)

## Extraction
If no error appears, we are ready to query. Depending on the number of requested tweets, this can take some time.

```{r, message = FALSE, warning = FALSE}
search_term <- "harris"
tweets <- search_tweets(
  search_term,          # What to search for
  n = 2000,             # Number of tweets to download
  include_rts = FALSE   # Exclude re-tweets
)
```
For large queries, the progress bar helps.   
Note that many options are available, like: exclude retweets, limit search to particular geographical zones (inside radiuses).

# Text mining

## References
The reference book is: https://www.tidytextmining.com      
And the package is:

```{r, message = FALSE, warning = FALSE}
if(!require(tidytext)){install.packages("tidytext", repos = "https://cloud.r-project.org/")}
library(tidytext)
```


## Data retrieval

Now, let's move forward to simple text analysis. First, we need to prepare the data! (as usual)

```{r, warning = FALSE, message = FALSE}
tokens <- tweets %>% 
    mutate(id = 1:nrow(tweets)) %>%  # This creates a tweet id
    select(id, text) %>%             # Keeps only id and text of the tweet
    unnest_tokens(word, text)        # Creates tokens!
tokens
```

Let's have a look at word frequencies.

```{r, warning = FALSE, message = FALSE}
tokens %>%
    count(word, sort = TRUE)
```

This is polluted by small words. Let's filter that (*FIRST METHOD*).

```{r, warning = FALSE, message = FALSE}
tokens %>% mutate(length = nchar(word))
```


## Data frequencies
Now let's omit the small words (smaller than 5 characters).   
**NOTE**: all the thresholds below depend on the sample! 

```{r, warning = FALSE, message = FALSE}
tokens %>%
    mutate(length = nchar(word)) %>%
    filter(length > 4) %>%             # Keep words with length larger than 4
    count(word, sort = TRUE) %>%       # Count words
    head(18) %>%                       # Keep only top 12 words
    ggplot(aes(x = reorder(word,n), y = n)) + geom_col() + coord_flip() +
  xlab("Words")
```

A better way to proceed is to remove "stop words" like "a", "I", "of", etc (*SECOND METHOD*).
Also, it would make sense to remove the search item and "https".

```{r, warning = FALSE, message = FALSE}
data("stop_words")
tidy_tokens <- tokens %>% 
    anti_join(stop_words)                    # Remove unrelevant terms
tidy_tokens %>%
    count(word, sort = TRUE) %>%             # Count words
    head(20) %>%                             # Keep only top 15 words
    ggplot(aes(x = reorder(word,n), y = n)) + geom_col() + coord_flip() +
      xlab("Words")
```

**Problem**: strange characters remain. We are going to remove them by converting the text to ASCII format and omit *NA* data. 

```{r, warning = FALSE, message = FALSE}
tidy_tokens <- tokens %>% 
    anti_join(stop_words) %>%                            # Remove unrelevant
    mutate(word = iconv(word, from = "UTF-8", to = "ASCII")) %>% # Put in latin format
    na.omit() %>%                                        # Remove missing
    filter(nchar(word) > 3,                              # Remove small words
           !(word %in% c("https", "t.co", search_term))  # search_term defined above
    )
tidy_tokens %>%
    count(word, sort = TRUE) %>%         # Count words
    head(15) %>%                         # Keep only top words
    ggplot(aes(x = reorder(word,n), y = n)) + geom_col() + coord_flip() +
  xlab("Words")
```

Perfect!

## Word cloud

This data can also be shown with a word cloud. We simply use the *wordcloud* package: https://cran.r-project.org/web/packages/wordcloud/index.html 

The package *wordcloud2* adds a few features: https://cran.r-project.org/web/packages/wordcloud2/vignettes/wordcloud.html

```{r, warning = FALSE, message = FALSE}
if(!require(wordcloud)){install.packages("wordcloud")}
library(wordcloud)
cloud_data <- tidy_tokens %>% count(word)
wordcloud(words = cloud_data$word, 
          freq = cloud_data$n, min.freq = 2,
          max.words = 100, random.order = FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))
```

## Sentiment

This section is inspired from: https://www.tidytextmining.com/sentiment.html    
Sometimes, you may be asked in the process if you *really* want to download data (lexicons).  
Just say yes in the **console** (type the correct answer: if not, you will be blocked/struck).

First, we need to load some sentiment lexicon. AFINN is one such sentiment database. 

```{r}
if(!require(textdata)){install.packages("textdata", repos = "https://cloud.r-project.org/")}
library(tidytext)
library(textdata)
afinn <- get_sentiments("afinn")
afinn
```

To create a nice visualization, we need to extract the **time** of the tweets.

```{r}
tokens_time <- tweets %>% 
    mutate(id = 1:nrow(tweets)) %>%    # This creates a tweet id
    select(id, text, created_at) %>%   # Keeps id, text and date of the tweet
    unnest_tokens(word, text)          # Creates tokens!
tokens_time
```

We then use **inner_join**() to merge the two sets. This function removes the cases when a match does not occur.

```{r}
library(lubridate)
sentiment <- tokens_time %>% 
  inner_join(afinn) %>%
  mutate(day = day(created_at),
         hour = hour(created_at) / 24,
         minute = minute(created_at) / 60 / 24,
         time = day + hour + minute)
sentiment
```

We then compute the average sentiment, minute-by-minute.   
Of course, average sentiment can be misleading. Indeed, if a text contains the terms "*I'm not happy*", then only "*happy*" will be tagged, which is the opposite of the intended meaning.

```{r}
sentiment %>%
  group_by(time, day, hour, minute) %>%
  summarise(avg_sentiment = mean(value)) %>%
  mutate(time = make_datetime(year = 2020, month = 10, day = day, hour = hour*24, min = minute*24*60)) %>%
  ggplot(aes(x = time, y = avg_sentiment)) + geom_col()
```
There are 24 bars per day, but the *y*-axis is not optimal...  
 
What about emotions? The **NRC** lexicon categorizes emotions. Below, we order emotions. The most important impact is the dichotomy between positive & negative emotions. 

```{r, message = FALSE, warning = FALSE}
nrc <- get_sentiments("nrc")
nrc <- nrc %>%
  mutate(sentiment = as.factor(sentiment),
         sentiment = recode_factor(sentiment,
                                   joy = "joy",
                                   trust = "trust",
                                   surprise = "surprise",
                                   anticipation = "anticipation",
                                   positive = "positive",
                                   negative = "negative",
                                   sadness = "sadness",
                                   anger = "anger",
                                   fear = "fear",
                                   digust = "disgust",
                                   .ordered = T))
```

We then create the merged dataset.

```{r}
emotions <- tokens_time %>% 
  inner_join(nrc) %>%             # Merge data with sentiment
  mutate(day = day(created_at),
         hour = hour(created_at)/24,
         minute = minute(created_at)/24/60,
         time = day+hour+minute)   # Create day column
emotions                          # Show the result
```

The merging has reduced the size of the dataset, but there still remains enough to pursue the study.   
Finally, we move to the pivot-table that counts emotions for each day.

```{r}
g <- emotions %>% 
  group_by(time, sentiment, day, hour, minute) %>%
  summarise(intensity = n()) %>%
  mutate(time = make_datetime(year = 2020, month = 10, day = day, hour = hour*24, min = minute*24*60)) %>%
  ggplot(aes(x = time, y = intensity, fill = sentiment)) + geom_col() + 
  theme(axis.text.x = element_text(angle = 80, 
                                   size = 10,
                                   hjust = 1)) + xlab("Time")
ggplotly(g)
```

This can also be shown in percentage format. 

```{r}
g <- emotions %>% 
  group_by(time, sentiment, day, hour, minute) %>%
  summarise(intensity = n()) %>%
  mutate(time = make_datetime(year = 2020, month = 10, day = day, hour = hour*24, min = minute*24*60)) %>%
  ggplot(aes(x = time, y = intensity, fill = sentiment)) + geom_col(position = "fill") +
  theme(axis.text.x = element_text(angle = 80, 
                                   size = 10,
                                   hjust = 1)) + xlab("Time")
ggplotly(g)
```

Going further would probably involve *n-grams*, see https://www.tidytextmining.com/ngrams.html


## Advanced sentiment 

The problem with the preceding methods is that they don't take into account **valence shifters** (i.e., negators, amplifiers (intensifiers), de-amplifiers (downtoners), and adversative conjunctions). If a tweet says *not happy*, counting the word *happy* is not a good idea! The package *sentimentr* is built to circumvent these issues: have a look at https://github.com/trinker/sentimentr  
(see also: https://www.sentometrics.org)

```{r, warning = FALSE, message = FALSE}
if(!require(sentimentr)){install.packages(c("sentimentr", "textcat"))}
library(sentimentr)
library(textcat)
```

First, let's keep only the tweets written in English!

```{r}
tweets_en <- tweets %>%
  mutate(language = textcat(text)) %>%
  filter(language == "english") %>%
  dplyr::select(created_at, text)
```

**NOTE**: the code above was used to show the function *textcat*: the language is already coded in the tweets via the **lang** column/variable. (it suffices to keep the instances for which lang == "en")

Next, we compute advanced sentiment. 

```{r}
tweet_sent <- tweets_en$text %>%
  get_sentences() %>%  # Intermediate function
  sentiment()          # Sentiment!
tweet_sent
```

**NOTE**: depending on frequency issues, it is better to analyze at daily or hourly scales. If a word is very popular, then, higher frequencies are more relevant. 

```{r}
tweets_en %>%
  rowid_to_column("element_id") # This creates a new column with row number

tweets_en %>%
  rowid_to_column("element_id") %>%
  left_join(tweet_sent, by = "element_id")

tweets_en %>%
  rowid_to_column("element_id") %>%
  left_join(tweet_sent, by = "element_id") %>%
  group_by(day = day(created_at)) %>%
  summarise(avg_sent = mean(sentiment)) %>%
  ggplot(aes(x = as.factor(day), y = avg_sent)) + geom_col() + xlab("day")

tweets_en %>%
  rowid_to_column("element_id") %>%
  left_join(tweet_sent, by = "element_id") %>%
  ggplot(aes(x = as.factor(day(created_at)), y = sentiment)) + 
  geom_jitter(size = 0.2) +
  geom_boxplot(aes(color = as.factor(day(created_at))), alpha = 0.5) +
  theme(legend.position = "none") + xlab("day")
```



# Resources

Below, a short list of resources (to access third-party data):   

- **text mining with R** (online book): https://www.tidytextmining.com      
- **Bloomberg**: https://cran.r-project.org/web/packages/Rblpapi/index.html   
- **gmail**: https://cran.r-project.org/web/packages/gmailr/vignettes/gmailr.html   
- **Google Maps**: https://cran.rstudio.com/web/packages/mapsapi/vignettes/intro.html  
- **Google trends**: https://github.com/PMassicotte/gtrendsR
- **Google APIs** (more generally): https://cran.r-project.org/web/packages/gargle/vignettes/auth-from-web.html


Possibly deprecated:  
- **Facebook**: https://cran.r-project.org/web/packages/Rfacebook/index.html    
- **Instagram**: https://cran.r-project.org/web/packages/instaR/index.html

```{r}

```
